Kolmogorov–Smirnov (KS) test (one-sample + two-sample)#

The Kolmogorov–Smirnov (KS) test is a nonparametric test about distributions:

  • One-sample KS: does one sample look like it came from a fully specified continuous distribution \(F_0\)?

  • Two-sample KS: do two samples look like they came from the same continuous distribution?

It does this by looking at the largest vertical gap between cumulative distribution functions (CDFs).


Learning goals#

After this notebook you should be able to:

  • explain what the KS statistic \(D\) measures (and what it does not measure)

  • compute \(D\) from scratch using only NumPy (no SciPy)

  • approximate p-values using Monte Carlo (one-sample) and permutations (two-sample)

  • interpret results in practice and spot common pitfalls

Prerequisites#

  • CDF vs PDF, and the idea of an empirical CDF (ECDF)

  • basic hypothesis testing (null, alternative, p-value)

import numpy as np

import plotly
import plotly.express as px
import plotly.graph_objects as go
import os
import plotly.io as pio
from plotly.subplots import make_subplots

pio.templates.default = "plotly_white"
pio.renderers.default = os.environ.get("PLOTLY_RENDERER", "notebook")

np.set_printoptions(precision=6, suppress=True)
rng = np.random.default_rng(42)

# Record versions for reproducibility (numerical details can vary across versions).
VERSIONS = {"numpy": np.__version__, "plotly": plotly.__version__}
VERSIONS
{'numpy': '1.26.2', 'plotly': '6.5.2'}

What the KS test is (and isn’t)#

The KS test is about the entire distribution, not a single summary like the mean.

  • If two distributions differ in location (mean/median), scale (variance), or shape (skew/heavy tails), KS can detect it.

  • The KS statistic \(D\) is an effect size: a maximum vertical distance between two CDF curves.

  • A small p-value means: “under the null hypothesis, a gap this large (or larger) would be rare.”

Typical uses:

  • Goodness-of-fit: Does a simulator/assumption match a known distribution \(F_0\)?

  • Distribution shift: Did a feature distribution change between two time periods?

  • A/B comparisons: Do two groups have different outcome distributions?

Key assumptions for the classic KS p-values:

  • observations are independent

  • the null distribution is continuous (ties / discrete support changes the calibration)

  • for the one-sample test, \(F_0\) is fully specified (if you estimate parameters from the same data, you need a correction like Lilliefors)

The core object: the CDF (and the ECDF)#

For a random variable \(X\), the cumulative distribution function (CDF) is

\[F(x) = \Pr(X \le x).\]

Given data \(x_1,\dots,x_n\), the empirical CDF (ECDF) is

\[\hat F_n(x) = \frac{1}{n} \sum_{i=1}^n \mathbf{1}\{x_i \le x\}.\]

The ECDF is a step function: it jumps by \(1/n\) at each observed value.

def ecdf_step(x):
    """Return sorted data and step-plot points for the ECDF."""
    x_sorted = np.sort(np.asarray(x))
    n = x_sorted.size
    if n == 0:
        raise ValueError("x must contain at least one observation.")

    # With line_shape="hv", repeating the first x gives the initial vertical jump from 0.
    x_step = np.r_[x_sorted[0], x_sorted]
    y_step = np.r_[0.0, np.arange(1, n + 1) / n]
    return x_sorted, x_step, y_step


def padded_linspace(x, n=500, pad=0.05):
    x = np.asarray(x)
    x_min = float(np.min(x))
    x_max = float(np.max(x))
    span = x_max - x_min
    if span == 0:
        span = 1.0
    return np.linspace(x_min - pad * span, x_max + pad * span, n)


def cdf_uniform_01(u):
    """CDF of Uniform(0,1) evaluated at u."""
    return np.clip(np.asarray(u), 0.0, 1.0)


def ks_1samp_statistic(sample, cdf):
    """One-sample KS statistic against a continuous, fully specified CDF.

    D is the max vertical gap between the ECDF and F0:
        D = sup_x |F_n(x) - F0(x)|

    For a sorted sample x_(1) <= ... <= x_(n), the maximum occurs at the order statistics.
    A convenient computation is via:
        D+ = max_i (i/n - F0(x_(i)))
        D- = max_i (F0(x_(i)) - (i-1)/n)
        D = max(D+, D-)
    """
    x = np.sort(np.asarray(sample))
    n = x.size
    if n == 0:
        raise ValueError("sample must contain at least one observation.")

    F = np.asarray(cdf(x), dtype=float)
    F = np.clip(F, 0.0, 1.0)

    i = np.arange(1, n + 1)
    d_plus_vals = i / n - F
    d_minus_vals = F - (i - 1) / n

    D_plus = float(np.max(d_plus_vals))
    D_minus = float(np.max(d_minus_vals))

    if D_plus >= D_minus:
        idx = int(np.argmax(d_plus_vals))
        direction = "+"  # ECDF is above F0 at the max gap
        y_emp = float(i[idx] / n)  # right-limit ECDF at x_(i)
        y_cdf = float(F[idx])
        D = D_plus
    else:
        idx = int(np.argmax(d_minus_vals))
        direction = "-"  # ECDF is below F0 at the max gap
        y_emp = float((i[idx] - 1) / n)  # left-limit ECDF at x_(i)
        y_cdf = float(F[idx])
        D = D_minus

    return {
        "D": float(D),
        "D_plus": D_plus,
        "D_minus": D_minus,
        "x_star": float(x[idx]),
        "y_emp": y_emp,
        "y_cdf": y_cdf,
        "direction": direction,
    }


def ks_2samp_statistic(x, y):
    """Two-sample KS statistic D = sup_x |F_n(x) - G_m(x)|."""
    x = np.sort(np.asarray(x))
    y = np.sort(np.asarray(y))
    n = x.size
    m = y.size
    if n == 0 or m == 0:
        raise ValueError("Both samples must contain at least one observation.")

    grid = np.sort(np.concatenate([x, y]))
    F_n = np.searchsorted(x, grid, side="right") / n
    G_m = np.searchsorted(y, grid, side="right") / m
    diff = np.abs(F_n - G_m)
    idx = int(np.argmax(diff))

    return {"D": float(diff[idx]), "x_star": float(grid[idx]), "F_n": float(F_n[idx]), "G_m": float(G_m[idx])}


def ks_pvalue_asymptotic_two_sided(D, n_eff, tol=1e-12, max_terms=10_000):
    """Asymptotic two-sided KS p-value via the Kolmogorov distribution.

    Uses a common Stephens-type finite-sample correction:
        lambda = (sqrt(n_eff) + 0.12 + 0.11/sqrt(n_eff)) * D

    This is great for intuition; for production, prefer a vetted library implementation.
    """
    D = float(D)
    n_eff = float(n_eff)
    if D <= 0:
        return 1.0
    if n_eff <= 0:
        raise ValueError("n_eff must be positive.")

    en = np.sqrt(n_eff)
    lam = (en + 0.12 + 0.11 / en) * D

    total = 0.0
    for k in range(1, int(max_terms) + 1):
        term = (-1.0) ** (k - 1) * np.exp(-2.0 * (k * k) * (lam * lam))
        total += term
        if abs(term) < tol:
            break

    p = 2.0 * total
    return float(np.clip(p, 0.0, 1.0))


def ks_1samp_null_ds(n, n_sim=5000, rng=None):
    """Simulate the null distribution of D for the one-sample KS test (continuous null)."""
    if rng is None:
        rng = np.random.default_rng()
    ds = np.empty(n_sim, dtype=float)
    for b in range(n_sim):
        u = rng.uniform(0.0, 1.0, size=n)
        ds[b] = ks_1samp_statistic(u, cdf_uniform_01)["D"]
    return ds


def mc_pvalue_right_tail(stat_obs, stats_null):
    """Add-one smoothed right-tail Monte Carlo p-value."""
    stats_null = np.asarray(stats_null)
    return float((np.sum(stats_null >= stat_obs) + 1.0) / (stats_null.size + 1.0))


def ks_2samp_pvalue_perm(x, y, n_perm=5000, rng=None):
    """Permutation p-value for the two-sample KS test (two-sided)."""
    if rng is None:
        rng = np.random.default_rng()

    x = np.asarray(x)
    y = np.asarray(y)
    n = x.size
    D_obs = ks_2samp_statistic(x, y)["D"]

    pooled = np.concatenate([x, y])
    ds = np.empty(n_perm, dtype=float)

    for b in range(n_perm):
        perm = rng.permutation(pooled.size)
        x_b = pooled[perm[:n]]
        y_b = pooled[perm[n:]]
        ds[b] = ks_2samp_statistic(x_b, y_b)["D"]

    return mc_pvalue_right_tail(D_obs, ds), ds

1) One-sample KS test (goodness-of-fit)#

Question: Does a sample look like it came from a given continuous distribution \(F_0\)?

Hypotheses (two-sided)#

  • \(H_0\): the data are i.i.d. from \(F_0\)

  • \(H_1\): the data are not from \(F_0\) (their CDF differs somewhere)

Test statistic#

\[D_n = \sup_x \left|\hat F_n(x) - F_0(x)\right|.\]

Intuition: imagine drawing the ECDF and the theoretical CDF. \(D_n\) is the largest vertical gap between those two curves.

Because the ECDF is a step function, you only need to check the order statistics \(x_{(1)} \le \dots \le x_{(n)}\):

\[D^+ = \max_i \left(\frac{i}{n} - F_0(x_{(i)})\right), \qquad D^- = \max_i \left(F_0(x_{(i)}) - \frac{i-1}{n}\right),\]

and \(D = \max(D^+, D^-)\).

def cdf_exponential(x, rate):
    """CDF of Exp(rate) with support x>=0."""
    x = np.asarray(x)
    return np.where(x < 0.0, 0.0, 1.0 - np.exp(-rate * x))


# Synthetic data: true distribution is Exp(rate=1.0)
n = 80
rng_data = np.random.default_rng(1)
x = rng_data.exponential(scale=1.0, size=n)

models = {
    "rate=1.0 (correct)": 1.0,
    "rate=0.5 (too heavy tail)": 0.5,
}

# Under a continuous fully specified null, D is distribution-free, so we simulate it once.
n_sim = 5000
rng_null = np.random.default_rng(2)
ds_null_1samp = ks_1samp_null_ds(n=n, n_sim=n_sim, rng=rng_null)

one_sample_results = {}
for name, rate in models.items():
    res = ks_1samp_statistic(x, lambda t, r=rate: cdf_exponential(t, r))
    one_sample_results[name] = {
        **res,
        "p_mc": mc_pvalue_right_tail(res["D"], ds_null_1samp),
        "p_asym": ks_pvalue_asymptotic_two_sided(res["D"], n_eff=n),
    }

one_sample_results
{'rate=1.0 (correct)': {'D': 0.05514334949238031,
  'D_plus': 0.027218963900686455,
  'D_minus': 0.05514334949238031,
  'x_star': 0.498639097854449,
  'y_emp': 0.3375,
  'y_cdf': 0.39264334949238033,
  'direction': '-',
  'p_mc': 0.9604079184163168,
  'p_asym': 0.9636165170896432},
 'rate=0.5 (too heavy tail)': {'D': 0.2747444532695955,
  'D_plus': 0.2747444532695955,
  'D_minus': 0.0035513851998576484,
  'x_star': 1.1966034782973949,
  'y_emp': 0.725,
  'y_cdf': 0.45025554673040447,
  'direction': '+',
  'p_mc': 0.0001999600079984003,
  'p_asym': 7.934385383992118e-06}}
res_1samp = one_sample_results["rate=1.0 (correct)"]
rate = models["rate=1.0 (correct)"]

x_sorted, x_step, y_step = ecdf_step(x)
x_grid = padded_linspace(x, n=600)

cdf_grid = cdf_exponential(x_grid, rate=rate)
ecdf_grid = np.searchsorted(x_sorted, x_grid, side="right") / x_sorted.size
abs_gap = np.abs(ecdf_grid - cdf_grid)

fig = make_subplots(
    rows=2,
    cols=1,
    shared_xaxes=True,
    vertical_spacing=0.12,
    row_heights=[0.65, 0.35],
)

# Top panel: ECDF vs theoretical CDF
fig.add_trace(
    go.Scatter(x=x_step, y=y_step, mode="lines", name="ECDF", line_shape="hv"),
    row=1,
    col=1,
)
fig.add_trace(
    go.Scatter(x=x_grid, y=cdf_grid, mode="lines", name=f"Exp CDF (rate={rate})"),
    row=1,
    col=1,
)

# Highlight the max gap D at x_star
fig.add_trace(
    go.Scatter(
        x=[res_1samp["x_star"], res_1samp["x_star"]],
        y=[res_1samp["y_emp"], res_1samp["y_cdf"]],
        mode="lines",
        name=f"D = {res_1samp['D']:.3f}",
        line=dict(color="crimson", width=4),
    ),
    row=1,
    col=1,
)
fig.add_trace(
    go.Scatter(
        x=[res_1samp["x_star"], res_1samp["x_star"]],
        y=[res_1samp["y_emp"], res_1samp["y_cdf"]],
        mode="markers",
        showlegend=False,
        marker=dict(color="crimson", size=9),
    ),
    row=1,
    col=1,
)

# Bottom panel: absolute gap as a function of x
fig.add_trace(
    go.Scatter(x=x_grid, y=abs_gap, mode="lines", name="|ECDF - CDF|", line=dict(color="gray")),
    row=2,
    col=1,
)
fig.add_vline(x=res_1samp["x_star"], line=dict(color="crimson", dash="dot"))
fig.add_hline(y=res_1samp["D"], line=dict(color="crimson", dash="dash"), row=2, col=1)

fig.update_yaxes(title_text="CDF value", row=1, col=1, range=[0, 1])
fig.update_yaxes(title_text="Absolute gap", row=2, col=1)
fig.update_xaxes(title_text="x", row=2, col=1)

fig.update_layout(
    title="One-sample KS (Exp rate=1.0): ECDF vs CDF and the max gap D",
    height=720,
)
fig.show()
# Visualize the null distribution of D and compare the two observed statistics
D_crit_05 = float(np.quantile(ds_null_1samp, 0.95))

D_correct = one_sample_results["rate=1.0 (correct)"]["D"]
D_wrong = one_sample_results["rate=0.5 (too heavy tail)"]["D"]

fig = px.histogram(
    x=ds_null_1samp,
    nbins=60,
    title="One-sample KS: null distribution of D (simulated under H0)",
    labels={"x": "D"},
)
fig.add_vline(
    x=D_crit_05,
    line_color="black",
    line_dash="dash",
    annotation_text=f"~5% critical value: {D_crit_05:.3f}",
    annotation_position="top left",
)
fig.add_vline(
    x=D_correct,
    line_color="seagreen",
    annotation_text=f"Observed D (correct): {D_correct:.3f}<br>p≈{one_sample_results['rate=1.0 (correct)']['p_mc']:.3f}",
    annotation_position="top right",
)
fig.add_vline(
    x=D_wrong,
    line_color="crimson",
    annotation_text=f"Observed D (wrong): {D_wrong:.3f}<br>p≈{one_sample_results['rate=0.5 (too heavy tail)']['p_mc']:.3f}",
    annotation_position="top right",
)
fig.update_layout(showlegend=False)
fig.show()

How to interpret the one-sample KS test#

  • \(D\) is the largest absolute CDF difference. Example: \(D=0.12\) means there exists a threshold \(x\) where the empirical probability mass below \(x\) differs from \(F_0(x)\) by about 12 percentage points.

  • The p-value is about how surprising that gap is under \(H_0\).

  • Decision rule at significance level \(\alpha\):

    • reject \(H_0\) if p-value \(< \alpha\)

    • otherwise: you fail to reject (you don’t “prove” \(H_0\); you just lack evidence against it)

Also note the sign:

  • If the maximum comes from \(D^+\) (direction + in the helper), the ECDF is above \(F_0\) at \(x_\*\). That typically means the sample is more concentrated on smaller values than \(F_0\) at that threshold.

  • If it comes from \(D^-\) (direction -), the ECDF is below \(F_0\) at \(x_\*\) (the sample puts less mass below that threshold).

2) Two-sample KS test (distribution comparison)#

Question: Do two independent samples look like they came from the same continuous distribution?

Hypotheses (two-sided)#

  • \(H_0\): both samples are i.i.d. from the same distribution (their CDFs are equal)

  • \(H_1\): the distributions differ

Test statistic#

\[D_{n,m} = \sup_x \left|\hat F_n(x) - \hat G_m(x)\right|.\]

Same intuition: draw both ECDFs; \(D_{n,m}\) is the biggest vertical gap.

For p-values here we’ll use a permutation test:

  • under \(H_0\), the combined sample is exchangeable; labels “A” vs “B” are arbitrary

  • so we repeatedly shuffle labels and recompute \(D\)

  • the p-value is the fraction of shuffled datasets with \(D_{\text{perm}} \ge D_{\text{obs}}\)

# Synthetic two-sample example
n1, n2 = 80, 90

rng_a = np.random.default_rng(3)
rng_b = np.random.default_rng(4)

x_a = rng_a.normal(loc=0.0, scale=1.0, size=n1)
x_b = rng_b.normal(loc=0.4, scale=1.0, size=n2)  # shifted mean

res_2samp = ks_2samp_statistic(x_a, x_b)

n_perm = 4000
rng_perm = np.random.default_rng(5)
p_perm, ds_perm_2samp = ks_2samp_pvalue_perm(x_a, x_b, n_perm=n_perm, rng=rng_perm)

n_eff = n1 * n2 / (n1 + n2)
p_asym = ks_pvalue_asymptotic_two_sided(res_2samp["D"], n_eff=n_eff)

{**res_2samp, "p_perm": p_perm, "p_asym": p_asym}
{'D': 0.18472222222222223,
 'x_star': 0.026124833534033623,
 'F_n': 0.5625,
 'G_m': 0.37777777777777777,
 'p_perm': 0.08322919270182455,
 'p_asym': 0.09825246208122136}
# Plot both ECDFs and the location of the max gap
x_sorted_a, x_step_a, y_step_a = ecdf_step(x_a)
x_sorted_b, x_step_b, y_step_b = ecdf_step(x_b)

grid = padded_linspace(np.concatenate([x_a, x_b]), n=700)
Fa = np.searchsorted(x_sorted_a, grid, side="right") / x_sorted_a.size
Fb = np.searchsorted(x_sorted_b, grid, side="right") / x_sorted_b.size
abs_gap = np.abs(Fa - Fb)

x_star = res_2samp["x_star"]
Fa_star = res_2samp["F_n"]
Fb_star = res_2samp["G_m"]

fig = make_subplots(
    rows=2,
    cols=1,
    shared_xaxes=True,
    vertical_spacing=0.12,
    row_heights=[0.65, 0.35],
)

fig.add_trace(
    go.Scatter(x=x_step_a, y=y_step_a, mode="lines", name="ECDF A", line_shape="hv"),
    row=1,
    col=1,
)
fig.add_trace(
    go.Scatter(x=x_step_b, y=y_step_b, mode="lines", name="ECDF B", line_shape="hv"),
    row=1,
    col=1,
)

# Highlight max gap D
fig.add_trace(
    go.Scatter(
        x=[x_star, x_star],
        y=[Fa_star, Fb_star],
        mode="lines",
        name=f"D = {res_2samp['D']:.3f}",
        line=dict(color="crimson", width=4),
    ),
    row=1,
    col=1,
)
fig.add_trace(
    go.Scatter(
        x=[x_star, x_star],
        y=[Fa_star, Fb_star],
        mode="markers",
        showlegend=False,
        marker=dict(color="crimson", size=9),
    ),
    row=1,
    col=1,
)

fig.add_trace(
    go.Scatter(x=grid, y=abs_gap, mode="lines", name="|ECDF A - ECDF B|", line=dict(color="gray")),
    row=2,
    col=1,
)
fig.add_vline(x=x_star, line=dict(color="crimson", dash="dot"))
fig.add_hline(y=res_2samp["D"], line=dict(color="crimson", dash="dash"), row=2, col=1)

fig.update_yaxes(title_text="CDF value", row=1, col=1, range=[0, 1])
fig.update_yaxes(title_text="Absolute gap", row=2, col=1)
fig.update_xaxes(title_text="x", row=2, col=1)

fig.update_layout(
    title="Two-sample KS: ECDFs and the max gap D",
    height=720,
)
fig.show()
# Permutation null distribution for the two-sample test
D_crit_05 = float(np.quantile(ds_perm_2samp, 0.95))

fig = px.histogram(
    x=ds_perm_2samp,
    nbins=60,
    title="Two-sample KS: permutation null distribution of D",
    labels={"x": "D"},
)
fig.add_vline(
    x=D_crit_05,
    line_color="black",
    line_dash="dash",
    annotation_text=f"~5% critical value: {D_crit_05:.3f}",
    annotation_position="top left",
)
fig.add_vline(
    x=res_2samp["D"],
    line_color="crimson",
    annotation_text=f"Observed D: {res_2samp['D']:.3f}<br>p≈{p_perm:.3f}",
    annotation_position="top right",
)
fig.update_layout(showlegend=False)
fig.show()

What does a KS result mean?#

1) The statistic \(D\) (effect size)#

  • One-sample: \(D\) is the maximum difference between the sample ECDF and \(F_0\).

  • Two-sample: \(D\) is the maximum difference between the two ECDFs.

A useful way to say it in plain language:

There exists a threshold \(x\) such that the two distributions disagree about the probability of being (\le x) by about \(D\).

2) The p-value (evidence)#

  • p-value is computed under \(H_0\).

  • small p-value: the observed max gap is unlikely if \(H_0\) were true → evidence against \(H_0\).

  • large p-value: the observed gap is plausible under \(H_0\) → not enough evidence to reject.

3) What the test does not tell you#

  • It does not tell you why distributions differ (mean shift? variance? tails?)

  • It does not tell you whether the difference is practically important (that’s your domain call; \(D\) helps)

  • With very large samples, tiny differences can become statistically significant

Pitfalls and diagnostics#

  • Discrete data / ties: the classic KS calibration assumes continuity. With many ties (e.g., integer counts), p-values can be off (often conservative). Consider alternatives (e.g., chi-square / exact tests) or permutation/bootstrap calibration.

  • Estimated parameters (one-sample): if you fit parameters using the same sample (e.g., test normality after fitting mean/std), the standard KS p-value is not valid. Look into Lilliefors-type corrections or use a parametric bootstrap.

  • Sensitivity profile: KS is typically more sensitive around the center of the distribution than the far tails. If tails matter, consider Anderson–Darling or Cramér–von Mises.

  • Independence: dependence (time series, clustered data) breaks the null distribution. Consider block bootstrap / time-series aware tests.

  • Multiple testing: if you test many features, control false positives (Bonferroni, BH/FDR, etc.).

Exercises#

  1. Modify the two-sample example so that the groups differ only in variance, not mean. How does \(D\) change?

  2. Simulate discrete data (e.g., Poisson) and see how ties affect the ECDF plot and the test calibration.

  3. For the one-sample setting, estimate the rate parameter from the sample (MLE) and compare:

    • naive KS p-value (treating the fitted distribution as fixed)

    • parametric bootstrap p-value (simulate from fitted distribution, refit, recompute D)

  4. Implement a one-sided KS test using \(D^+\) or \(D^-\) and interpret what each side means.

References#

  • Kolmogorov–Smirnov test (general overview): https://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test

  • Stephens, M. A. (1970). Use of the Kolmogorov–Smirnov, Cramér–von Mises and related statistics without extensive tables.

  • For parameter-estimation caveat (Lilliefors test): https://en.wikipedia.org/wiki/Lilliefors_test